{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7-_XzuAwYuok"
      },
      "source": [
        "# Learn OpenAI Whisper - Chapter 9\n",
        "## Notebook 2 - PVS step 1: Processing audio files to LJSpeech format with Whisper and OZEN\n",
        "\n",
        "This notebook complements the book [Learn OpenAI Whisper](https://a.co/d/1p5k4Tg).\n",
        "\n",
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1wnomL0dxmU9CgPKIgazR8AocolEYjAe5)\n",
        "\n",
        "This notebook represents the initial step in the 3-step personalized voice synthesis (PVS) process outlined in the chapter. This step takes an audio sample of the target voice as input and processes it into the LJSpeech dataset format. The notebook demonstrates how to use the OZEN Toolkit and OpenAI's Whisper to extract speech, transcribe it, and organize the data according to the LJSpeech structure. The resulting LJSpeech-formatted dataset, consisting of segmented audio files and corresponding transcriptions, serves as the input for the second step, \"PVS step 2: Fine-tuning a discrete variational autoencoder using the DLAS toolkit,\" where a PVS model is fine-tuned using this dataset.\n",
        "\n",
        "This notebook is based on the [OZEN Toolkit](https://github.com/devilismyfriend/ozen-toolkit) project. Given a folder of files or a single audio file, it will extract the speech, transcribe using Whisper and save in the LJ format (segmented audio files in WAV format in `wavs` folder, transcriptions in folders `train` and `valid`).\n",
        "\n",
        "**NOTE**: The notebook stores the files using the following format.\n",
        "\n",
        "`dataset/`\n",
        "* ---├── `valid.txt`\n",
        "* ---├── `train.txt`\n",
        "* ---├── `wavs/`\n",
        "\n",
        "`wavs/` directory must contain `.wav` files.\n",
        "\n",
        "Example for `train.txt` and `valid.txt`:\n",
        "\n",
        "* `wavs/A.wav|Write the transcribed audio here.`\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MZ7hGRHw9GLy"
      },
      "source": [
        "## 1.\tCloning the OZEN Toolkit repository:\n",
        "\n",
        "The following command clones the OZEN Toolkit repository from GitHub, which contains the necessary scripts and utilities for processing audio files:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "7n6I0nrcQj2F"
      },
      "outputs": [],
      "source": [
        "!git clone https://github.com/devilismyfriend/ozen-toolkit"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4rvuQoC_9NQr"
      },
      "source": [
        "## 2.\tInstalling required libraries\n",
        "\n",
        "These following commands install the necessary libraries for audio processing, speech recognition, and text formatting:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "gsEWSIVHR5Er"
      },
      "outputs": [],
      "source": [
        "!pip -q install transformers\n",
        "!pip -q install huggingface\n",
        "!pip -q install pydub\n",
        "!pip -q install yt-dlp\n",
        "!pip -q install pyannote.audio"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "K7msTh8jL6jX"
      },
      "source": [
        "# RESTART SESSION\n",
        "\n",
        "After installing the dependencies, restarting the session is recommended to ensure the installed packages are properly initialized.\n",
        "\n",
        "In Google Colab, from the top menu, select `Runtime`, then `Restart session`.\n",
        "<img src=\"https://github.com/PacktPublishing/Learn-OpenAI-Whisper/raw/main/Chapter09/images/Restart_the_runtime_600x102.png\" width=600>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bTqMldutL8-5"
      },
      "outputs": [],
      "source": [
        "!pip -q install colorama\n",
        "!pip -q install termcolor\n",
        "!pip -q install pyfiglet"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zouAX0o2-fB2"
      },
      "source": [
        "## 3.\tChanging the working directory\n",
        "\n",
        "The next command changes the working directory to the cloned ozen-toolkit directory:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4KjQB5N6-eDT"
      },
      "outputs": [],
      "source": [
        "%cd ozen-toolkit"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uNHAT18b-psU"
      },
      "source": [
        "## 4.\tDownloading a sample audio file\n",
        "\n",
        "If you do not have an audio file for synthesis, this command downloads a sample audio file from the specified URL for demonstration purposes:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "qilVdaLsQHEq"
      },
      "outputs": [],
      "source": [
        "# Download sample file\n",
        "!wget -nv https://github.com/PacktPublishing/Learn-OpenAI-Whisper/raw/main/Chapter01/Learn_OAI_Whisper_Sample_Audio01.mp3"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IyKrUHwN_RdE"
      },
      "source": [
        "## 5.\tUploading custom audio files\n",
        "\n",
        "If you have your audio file, this code block allows users to upload their audio files to the Colab environment. It creates a directory in `/content/ozen-toolkit` to store the uploaded files and saves them in that directory:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "VQgw3KeV8Yqb"
      },
      "outputs": [],
      "source": [
        "# CUSTOM_VOICE_NAME = \"custom\"\n",
        "\n",
        "import os\n",
        "from google.colab import files\n",
        "\n",
        "custom_voice_folder = \"./myaudiofile\"\n",
        "\n",
        "os.makedirs(custom_voice_folder, exist_ok=True)  # Create the directory if it doesn't exist\n",
        "\n",
        "for filename, file_data in files.upload().items():\n",
        "    with open(os.path.join(custom_voice_folder, filename), 'wb') as f:\n",
        "        f.write(file_data)\n",
        "\n",
        "%ls -l \"$PWD\"/{*,.*}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CWIANgl6B6_4"
      },
      "source": [
        "## 6.\tCreating a configuration file\n",
        "The following code section creates a configuration file named `config.ini` using the `configparser` library. It defines various settings such as the Hugging Face API key, Whisper model, device, diarization and segmentation models, validation ratio, and segmentation parameters:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "MLHFTm15WatB"
      },
      "outputs": [],
      "source": [
        "import configparser\n",
        "\n",
        "# Create a new ConfigParser object\n",
        "config = configparser.ConfigParser()\n",
        "\n",
        "# Add the 'DEFAULT' section and set the options\n",
        "config['DEFAULT'] = {\n",
        "    'hf_token': '<YOUR_HF_TOKEN_HERE>',\n",
        "    'whisper_model': 'openai/whisper-medium',\n",
        "    'device': 'cuda',\n",
        "    'diaization_model': 'pyannote/speaker-diarization',\n",
        "    'segmentation_model': 'pyannote/segmentation',\n",
        "    'valid_ratio': '0.2',\n",
        "    'seg_onset': '0.7',\n",
        "    'seg_offset': '0.55',\n",
        "    'seg_min_duration': '2.0',\n",
        "    'seg_min_duration_off': '0.0'\n",
        "}\n",
        "\n",
        "# Write the configuration to a file\n",
        "with open('config.ini', 'w') as configfile:\n",
        "    config.write(configfile)\n",
        "\n",
        "# Print the contents of the file\n",
        "with open('config.ini', 'r') as configfile:\n",
        "    print(configfile.read())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-XGpcx7QKr7z"
      },
      "source": [
        "## 7.\tRunning the OZEN script\n",
        "\n",
        "This command runs the ozen.py script with the sample audio file as an argument (or the file you uploaded).\n",
        "\n",
        "# IMPORTANT:\n",
        "`ozen.py` requires Hugging Face's `pyannote/segmentation` model. This is a gated model; you MUST request access before attempting to run the next cell. Thankfully, getting access is relatively straightforward and fast.\n",
        "\n",
        "*   You must already have a Hugging Face account; if you do not have one, see the instructions in the notebook for chapter 3:  [LOAIW_ch03_working_with_audio_data_via_Hugging_Face.ipynb](https://colab.research.google.com/drive/1bIiGyv_YiTdq97a7KrowCceOrZlG2hXL#scrollTo=VCEKs-Y4wAYQ)\n",
        "*   Visit https://hf.co/pyannote/segmentation to accept the user conditions.\n",
        "\n",
        "<img src=\"https://github.com/PacktPublishing/Learn-OpenAI-Whisper/raw/main/Chapter09/images/HF_pyanonote_sementation_gated_model.JPG\" width=600>\n",
        "\n",
        "The script processes the audio file, extracts speech, transcribes it using Whisper, and saves the output in the LJSpeech format. The script saves the DJ format files in a folder called `ozen-toolkit/output/<audio file name + timestamp>/`. Here is an example of the expected file structure:\n",
        "```\n",
        "ozen-toolkit/output/\n",
        "---├── Learn_OAI_Whisper_Sample_Audio01.mp3_2024_03_16-16_36/\n",
        "------------------├── valid.txt\n",
        "------------------├── train.txt\n",
        "------------------├── wavs/\n",
        "--------------------------├── 0.wav\n",
        "--------------------------├── 1.wav\n",
        "--------------------------├── 2.wav\n",
        "```"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "QEhKULyV9rT3"
      },
      "outputs": [],
      "source": [
        "!python ozen.py Learn_OAI_Whisper_Sample_Audio01.mp3"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Agpune8JjsQX"
      },
      "source": [
        "#### Mount Google Drive (To save trained checkpoints and to load the dataset from)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pec0vCcWPc9z"
      },
      "source": [
        "## 8.\tMounting Google Drive\n",
        "\n",
        "These lines mount the user's Google Drive to the Colab environment, allowing access to the drive for saving checkpoints and loading datasets:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xFt4umCqjlS0"
      },
      "outputs": [],
      "source": [
        "from google.colab import drive\n",
        "drive.mount('/content/gdrive')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6IhYDEIzQzhT"
      },
      "source": [
        "## 9.\tCopying the output to Google Drive\n",
        "\n",
        "The following command copies the processed output files from the `ozen-toolkit/output` directory to your Google Drive.\n",
        "\n",
        "<img src=\"https://github.com/PacktPublishing/Learn-OpenAI-Whisper/raw/main/Chapter09/images/ch09_2_Google_Colab_directory.JPG\" width=600>\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "CDHLbbE7GgjQ"
      },
      "outputs": [],
      "source": [
        "%cp -r /content/ozen-toolkit/output/ /content/gdrive/MyDrive/"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zT66okSEUzV3"
      },
      "source": [
        "After running the cell, go to your Google Drive using a web browser, and you will see a directory called `output` with the DJ format dataset files in it.\n",
        "\n",
        "<img src=\"https://github.com/PacktPublishing/Learn-OpenAI-Whisper/raw/main/Chapter09/images/ch09_2_Google_Drive_directory.JPG\" width=500>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lyoCGBH47_Jp"
      },
      "source": [
        "---\n",
        "With our audio data now converted to the LJSpeech format, we are well-prepared to embark on the following critical stage of the PVS journey: fine-tuning a VPS model using the powerful DLAS toolkit. The notebook [LOAIW_ch09_3_Fine-tuning_PVS_models_with_DLAS.ipynb](LOAIW_ch09_3_Fine-tuning_PVS_models_with_DLAS.ipynb) will cover that process in detail. By leveraging the DLAS toolkit's comprehensive features and the structured LJSpeech dataset, we can create a personalized voice model that captures the unique characteristics of our target voice with remarkable accuracy and naturalness."
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "gpuType": "T4",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}